{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "5163ad8d-faf3-4754-9d48-38672bc20afd",
   "metadata": {},
   "source": [
    "# NV-Ingest: CLI Client Quick Start Guide\n",
    "\n",
    "This notebook provides a quick start guide to using the NV-Ingest client to interact with a running NV-Ingest cluster. It will walk through the following:\n",
    "\n",
    "- Explore the CLI client help utility\n",
    "- Submit a single file NV-Ingest job with the CLI client\n",
    "- Submit a batch NV-Ingest job with the CLI client\n",
    "- View NV-Ingest job outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e0a6279-e78f-412a-94e7-a64ce5c0e4df",
   "metadata": {},
   "source": [
    "Specify a few notional files for testing and parameters to connect with a running NV-Ingest cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "daf6a66e-1e8f-4d7c-ac70-91b2c79a9951",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# sample input file and output directories\n",
    "SAMPLE_PDF0 = \"/workspace/nv-ingest/data/multimodal_test.pdf\"\n",
    "os.environ[\"SAMPLE_PDF0\"] = SAMPLE_PDF0\n",
    "SAMPLE_PDF1 = \"/workspace/nv-ingest/data/functional_validation.pdf\"\n",
    "BATCH_FILE = \"/workspace/client_examples/examples/dataset.json\"\n",
    "os.environ[\"BATCH_FILE\"] = BATCH_FILE\n",
    "OUTPUT_DIRECTORY_SINGLE = \"/workspace/client_examples/examples/processed_docs_single\"\n",
    "OUTPUT_DIRECTORY_BATCH = \"/workspace/client_examples/examples/processed_docs_batch\"\n",
    "os.environ[\"OUTPUT_DIRECTORY_SINGLE\"] = OUTPUT_DIRECTORY_SINGLE\n",
    "os.environ[\"OUTPUT_DIRECTORY_BATCH\"] = OUTPUT_DIRECTORY_BATCH"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c5d7f9b6-9942-4ec6-b11c-aa2fb1a3855d",
   "metadata": {},
   "source": [
    "## The NV-Ingest CLI Client"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69547089-554c-4643-a1b0-635655ac56d7",
   "metadata": {},
   "source": [
    "This section will illustrate usage of the `nv-ingest-cli` client to submit ingest jobs to an up and running NV-Ingest cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "71c88c9b-be77-4a65-b8c9-75241db1678c",
   "metadata": {},
   "source": [
    "### Help Utility"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40d7d46b-0000-42a8-9e83-058f671feb90",
   "metadata": {},
   "source": [
    "The CLI help utility will provide a description of settings and arguments that can be used to configure ingest jobs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76a265f3-bcaa-4436-9338-9d6041c0d86e",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nv-ingest-cli --help"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c24c2e13-8f2b-484f-843b-d908f7ecea2e",
   "metadata": {},
   "source": [
    "### Submitting a Single File Job"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "715fed85-ccb1-4c32-afa3-d5de4e19a6c1",
   "metadata": {},
   "source": [
    "This section will demonstrate a CLI example that submits a single file extraction oriented NV-Ingest job and save outputs locally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a4080c2-a245-46ee-9a1b-9e8281f7e6f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nv-ingest-cli \\\n",
    "  --doc ${SAMPLE_PDF0} \\\n",
    "  --task='extract:{\"document_type\": \"pdf\", \"extract_method\": \"pdfium\", \"extract_text\": true, \"extract_images\": true, \"extract_tables\": true, \"extract_tables_method\": \"yolox\"}' \\\n",
    "  --task='dedup:{\"content_type\": \"image\", \"filter\": true}' \\\n",
    "  --task='filter:{\"content_type\": \"image\", \"min_size\": 128, \"max_aspect_ratio\": 5.0, \"min_aspect_ratio\": 0.2, \"filter\": true}' \\\n",
    "  --client_host=${REDIS_HOST} \\\n",
    "  --client_port=${REDIS_PORT} \\\n",
    "  --output_directory=${OUTPUT_DIRECTORY_SINGLE}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7cac54a-8dc5-44a5-be01-64bf3c9c930b",
   "metadata": {},
   "source": [
    "The outputs will be saved locally for usage after job completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3eeb6d26-3935-423f-91cd-aa6dea774133",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tree processed_docs_single"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5fbf0c3-1172-4967-979a-2660a932c1ea",
   "metadata": {},
   "source": [
    "### Submitting a Batch Job"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dd642383-a031-41fa-aca8-011b16837b0a",
   "metadata": {},
   "source": [
    "Alternatively, a batch job can be submitted using on a json file that includes list documents to be ingested. This json file will need to include the following keys:\n",
    "\n",
    "- `sampled_files` - A list of paths to files for the ingest job.\n",
    "- `metadata` - Requires a `file_type_proportions` key with a sub-array that can be empty (this requirement may be deprecated in future versions).\n",
    "\n",
    "All files included have the same ingest task configuration applied to them."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5dcab38f-98f4-494a-bff7-03348e46c420",
   "metadata": {},
   "source": [
    "Create a notional json file to demonstrate usage of the CLI batch job configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85abcd01-b011-4eef-9a26-dbb34b35f91f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "batch_files = {\"sampled_files\": [SAMPLE_PDF0, SAMPLE_PDF1], \"metadata\": {\"file_type_proportions\": {}}}\n",
    "\n",
    "with open(BATCH_FILE, \"w\") as f:\n",
    "    json.dump(batch_files, f, indent=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28c47db5-0564-4ebc-9537-f7dfb88e97d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat $BATCH_FILE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fa7f839-0382-4905-804c-485163ed696d",
   "metadata": {},
   "source": [
    "The results of this job will be stored locally in the same file hirearchy. The names of each file will map to the file name in the dataset file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d2ee968-66ec-4d24-af53-5014e61b32ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "!nv-ingest-cli \\\n",
    "  --dataset ${BATCH_FILE} \\\n",
    "  --task='extract:{\"document_type\": \"pdf\", \"extract_method\": \"pdfium\", \"extract_text\": true, \"extract_images\": true, \"extract_tables\": true, \"extract_tables_method\": \"yolox\"}' \\\n",
    "  --task='dedup:{\"content_type\": \"image\", \"filter\": true}' \\\n",
    "  --task='filter:{\"content_type\": \"image\", \"min_size\": 128, \"max_aspect_ratio\": 5.0, \"min_aspect_ratio\": 0.2, \"filter\": true}' \\\n",
    "  --client_host=${REDIS_HOST} \\\n",
    "  --client_port=${REDIS_PORT} \\\n",
    "  --output_directory=${OUTPUT_DIRECTORY_BATCH}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c7e8bd9-9882-4acc-bd01-9dc53328e746",
   "metadata": {},
   "source": [
    "The outputs will be saved locally for usage after job completion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33fb01e7-c08e-4505-960d-85fabff1d161",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tree processed_docs_batch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa5f0ec1-42c8-422d-a599-2e88618f958a",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
